CSSS/SOC/STAT 321: Lab 1

Introduction to R/RStudio

Tao Lin

Introduction

Agenda

  • Introduction

  • R and RStudio

  • QSS Exercise

Logistics

  • Labs:
    • AC: Wed & Fri 11:30 am - 12:20 pm, Denny 213
    • AD: Wed & Fri 12:30 pm - 1:20 pm, Thompson 334
  • Office Hours:
    • Fri 1:30 pm - 3:30 pm or by appointment, Smith 35
  • The Goal of Labs
    • Review Course Materials, esp. readings and assignments.
    • Develop Coding Skills and Practices in R.
  • Lab Materials are published at https://github.com/soxv/CSSS-321-Labs.
    • I also post slides on Canvas, but will sometimes update previous slides in the GitHub repo.

Important Deadlines

  • QSS Tutorials: due every Tuesday
  • Problem Sets
    • Problem Set 1 (Randomized Experiments): April 16
    • Problem Set 2 (Summarizing Data): April 30
    • Problem Set 3 (Regression): May 21
    • Problem Set 4 (Inference): June 4
  • Midterm: May 12
  • Final Project
    • Proposals: April 21
    • (Initial) Analyses: May 19
    • Final Reports: June 7

How to Look for Help?

  • Use help() or ? to check out function documentation.
  • Just google the error message and find them on Stack Overflow or GitHub!
  • New options in the AI era: ChatGPT or New Bing!
  • We use Canvas discussion board for Q&A and troubleshooting
    • People encountering similar problems can see how to solve them (avoid “reinventing the wheel”).
    • We encourage you to help your peers on the discussion board.
    • If you have questions that cannot be covered by one single post, please come to my office hours.

Minimal Reproducible Example (MRE)

Introduce Yourself!

  • What is your major / year?
  • Why do you take this course?
  • What is your experience with data science and R?

R and RStudio

Agenda

  • Introduction

  • R and RStudio

  • QSS Exercise

What is R?

  • A free and open-source language for statistical computing and graphics.
  • How to install R from CRAN (The Comprehensive R Archive Network).
    • Mac:
      • choose “R-4.2.3-arm64.pkg” for Apple silicon Macs (M1 or higher)
      • otherwise choose “R-4.2.3.pkg”.
      • also install the Command Line Tools by typing xcode-select --install in your terminal if you haven’t done so. It helps compile some R packages that rely on other languages such as C++ or Fortran.
    • Windows:
      • click “install R for the first time”
      • also click “RTools” and install it. Rtools can help compile some R packages that rely on other languages such as C++ or Fortran.
  • Two components:
    • R console
      • run command on it to generate corresponding output.
      • analogy: musical instrument 🎻
    • R script:
      • record the command in plain text; easier for other people to circulate and reproduce your results.
      • analogy: written sheet music with notations 🎼

What is RStudio?

  • An integrated development environment (IDE) for R. It includes:
    • R console
    • syntax-highlighting editor
    • tools for plotting, debugging, and workspace management
  • TL;DR - RStudio provides various tools that makes R programming easier.
  • Install RStudio from Posit.

RStudio Setup

  • Change Appearance and Pane Layout: Tools > Global Options... > Appearance/Pane Layout
  • Don’t save workspace to .Rdata on exit: Tools > Global Options... > General > “Save workspace to .RData on exit” > “Never”
  • Don’t restore .Rdata into workspace: Tools > Global Options... > General > uncheck “Restore .RData into workspace at startup”
    • Reloading a saved workspace may be convenient to you; but it makes your code less reproducible on other people’s machine.

Install Packages

  • What is package?
    • a collection of R functions, compiled codes and datasets for reuse.
    • Important packages for this course:
      • tidyverse: a bundle of packages for data wrangling.
      • rmarkdown: write documents that embeds R code as well as its output.
      • qsslearnr: interactive tutorials for Quantitative Social Science
  • Two actions
    • Installing package: download the package to your computer.
    • Loading package: tell R to use the package.
    • you only need to install once; but you need to load every time.
  • Install packages from different sources
    • install.packages("tidyverse") downloads from CRAN by default.
    • remotes::install_github("rstudio/learnr") downloads from GitHub.
    • CRAN maintains packages with strict quality requirements by R core teams; GitHub maintains packages by individual developers or small teams that may not go through the same level of testing and quality control as CRAN packages.
## Interactive Tutorials for Quantitative Social Science
## Written by Matthew Blackwell 
## See here: https://github.com/mattblackwell/qsslearnr

# 1. Install `remotes` package: install.packages("remotes")

## 2. Install the following packages by running:
remotes::install_github("kosukeimai/qss-package", build_vignettes = TRUE)
remotes::install_github("rstudio/learnr")
remotes::install_github("rstudio-education/gradethis")
remotes::install_github("mattblackwell/qsslearnr")

## 3. See all available tutorials for QSS
learnr::run_tutorial(package = "qsslearnr")

## 4. Run a particular tutorial
learnr::run_tutorial("00-intro", package = "qsslearnr")

## 5. If you have problems generating PDF from Rmarkdown
## install tinytex by running (takes some time!): 
# install.packages("tinytex")
# tinytex::install_tinytex()
  • For Mac users, sometimes the installation of qss package may fail because pandoc or curl is not installed or upgraded in your Mac (if you don’t encounter these problems, no need to look at this!).
    • pandoc is used to convert documents to other types, e.g. convert .html to .pdf or .docx.
    • curl is used to transfer data through URLs.
    • To install or upgrade pandoc or curl, we can first install the package manager Homebrew, and then install them by using brew install pandoc or brew install curl in Mac Terminal.

QSS Exercise

Agenda

  • Introduction

  • R and RStudio

  • QSS Exercise

QSS Tutorial 0

Any questions?

Bias in Self-Reported Turnout

  • Use read.csv() to load the voter turnout data
    • If your datasets are stored in other formats, such as .xlsx, .sav or .dta, you need external packages such as readxl, foreign or haven to help you load your datasets.
  • File management practices in R
    • Option 1: Use setwd() to open target folder as current working directory in R.
    • Option 2: Open your target folder as an R project.
  • In this exercise, we will also explore how to visualize data in R in two different ways:
    • R base graphics: require more code to generate plots, but is more flexible.
    • ggplot2: requires less code to generate plots, but is more restrictive, e.g. we need to transform the data before plotting.
Variable Description
year election year
ANES ANES estimated turnout rate
VEP voting eligible population (in thousands)
VAP voting age population (in thousands)
total total ballots cast for highest office (in thousands)
felons total ineligible felons (in thousands)
noncitizens total noncitizens (in thousands)
overseas total eligible overseas voters (in thousands)
osvoters total ballots counted by overseas voters (in thousands)
  1. Load the data into R and check the dimensions of the data. Also, obtain a summary of the data. How many observations are there? What is the range of years covered in this data set?
turnout <- read.csv("./data/turnout.csv") # load the dataset as a data.frame in R

dim(turnout) # the dimensions of the dataset: 14 rows (observations) x 9 columns (variables)
[1] 14  9
# we can also use `nrow()` to solely fetch the number of rows
# and `ncol()` to solely fetch the number of columns
head(turnout, n = 5) # the first 5 rows of the dataset
  year    VEP    VAP total ANES felons noncit overseas osvoters
1 1980 159635 164445 86515   71    802   5756     1803       NA
2 1982 160467 166028 67616   60    960   6641     1982       NA
3 1984 167702 173995 92653   74   1165   7482     2361       NA
4 1986 170396 177922 64991   53   1367   8362     2216       NA
5 1988 173579 181955 91595   70   1594   9280     2257       NA
summary(turnout) # get the range and quartiles of each variable
      year           VEP              VAP             total       
 Min.   :1980   Min.   :159635   Min.   :164445   Min.   : 64991  
 1st Qu.:1986   1st Qu.:171192   1st Qu.:178930   1st Qu.: 73179  
 Median :1993   Median :181140   Median :193018   Median : 89055  
 Mean   :1993   Mean   :182640   Mean   :194226   Mean   : 89778  
 3rd Qu.:2000   3rd Qu.:193353   3rd Qu.:209296   3rd Qu.:102370  
 Max.   :2008   Max.   :213314   Max.   :230872   Max.   :131304  
                                                                  
      ANES           felons         noncit         overseas       osvoters  
 Min.   :47.00   Min.   : 802   Min.   : 5756   Min.   :1803   Min.   :263  
 1st Qu.:57.00   1st Qu.:1424   1st Qu.: 8592   1st Qu.:2236   1st Qu.:263  
 Median :70.50   Median :2312   Median :11972   Median :2458   Median :263  
 Mean   :65.79   Mean   :2177   Mean   :12229   Mean   :2746   Mean   :263  
 3rd Qu.:73.75   3rd Qu.:3042   3rd Qu.:15910   3rd Qu.:2937   3rd Qu.:263  
 Max.   :78.00   Max.   :3168   Max.   :19392   Max.   :4972   Max.   :263  
                                                               NA's   :13   
turnout$year # we use `$` to get specific variables from the data.frame; this is a vector of year
 [1] 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2008
# alternatively, we can use `turnout[, "year"]` to do the same thing.
length(turnout$year) # the length of `year` vector is 14.
[1] 14
  1. Calculate the turnout rate based on the voting age population or VAP. Note that for this data set, we must add the total number of eligible overseas voters since the VAP variable does not include these individuals in the count. Next, calculate the turnout rate using the voting eligible population or VEP. What difference do you observe? (Additionally, how can we visualize the temporal change in this difference?)
turnout$tr_vap <- turnout$total / (turnout$VAP + turnout$overseas) * 100
turnout$tr_vep <- turnout$total / turnout$VEP * 100
turnout$tr_vep - turnout$tr_vap
 [1] 2.155785 1.891789 2.711115 2.062703 3.045878 2.480105 4.072866 3.095397
 [9] 4.124166 3.261470 4.882388 3.682145 5.553078 5.880239
plot(
  tr_vep - tr_vap ~ year, # variable on y-axis ~ variable on x-axis
  data = turnout, # the dataset from which we extract variables for plotting
  type = "l", # the type of plot; we use lines to visualize the time series 
  xlab = "Year", ylab = "VEP-based TR - VAP-based TR (%)", # labels of x- and y-axis
  xaxt = "n" # override previous x-axis breaks
)
axis(
  side = 1, # redraw the breaks of x-axis to reflect four-year election cycle
  at = seq(1980, 2008, 4) # highlight presidential election year
)
  1. Compute the differences between the VAP and ANES estimates of turnout rate. How big is the difference on average? What is the range of the differences? Conduct the same comparison for the VEP and ANES estimates of voter turnout. Briefly comment on the results.
diff_vap <- turnout$ANES - turnout$tr_vap
summary(diff_vap)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  11.06   18.22   20.62   20.33   22.42   26.17 
diff_vep <- turnout$ANES - turnout$tr_vep
summary(diff_vep)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.581  15.267  16.893  16.836  18.529  22.489 
  1. Compare the VEP turnout rate with the ANES turnout rate separately for presidential elections and midterm elections. Note that the data set excludes the year 2006. Does the bias of the ANES estimates vary across election types?
turnout$midterm <- ifelse(turnout$year %% 4 != 0, 1, 0) # presidential elections take place in leap year (can be divided by 4); thus, we can use this fact to recognize the midterm election year
turnout$tr_vep[turnout$midterm == 0]
[1] 54.19551 55.24860 52.76848 58.11384 51.65793 54.22449 60.10084 61.55433
turnout$tr_vep[turnout$midterm == 1]
[1] 42.13701 38.14115 38.41895 41.12625 38.09316 39.51064
mean(turnout$tr_vep[turnout$midterm == 0]) - mean(turnout$tr_vep[turnout$midterm == 1])
[1] 16.41181
  1. Divide the data into half by election years such that you subset the data into two periods. Calculate the difference between the VEP turnout rate and the ANES turnout rate separately for each year within each period. Has the bias of ANES increased over time?
diff_vep[1:7]
[1] 16.804491 17.862987 18.751404 14.858846 17.231520  8.581054 16.886160
diff_vep[8:14]
[1] 14.87375 21.34207 13.90684 18.77551 22.48936 16.89916 16.44567
mean(diff_vep[8:14]) - mean(diff_vep[1:7])
[1] 1.965126
  1. ANES does not interview prisoners and overseas voters. Calculate an adjustment to the 2008 VAP turnout rate. Begin by subtracting the total number of ineligible felons and noncitizens from the VAP to calculate an adjusted VAP. Next, calculate an adjusted VAP turnout rate, taking care to subtract the number of overseas ballots counted from the total ballots in 2008. Compare the adjusted VAP turnout with the unadjusted VAP, VEP, and the ANES turnout rate. Briefly discuss the results. (Additionally, how can we visualize the comparison among the 4 types of turnout rate?)
turnout$adj_tr_vap <- (turnout$total - turnout$overseas) / (turnout$VAP - turnout$felons - turnout$noncit) * 100

# install.packages("tidyverse")
library(tidyverse)
turnout %>% # we use the pipe operator `%>%` to avoid repeatedly referring to the dataset in subsequent functions
  pivot_longer( # we use `pivot_longer()` to reshape data from wide form to long form, which is more efficient for visualization; see `vignette("pivot")`.
    c("tr_vap", "tr_vep", "ANES", "adj_tr_vap"), # we plan to convert these columns into one variable
    names_to = "type", values_to = "turnout_rate"
  ) %>%
  ggplot(aes(x = year, y = turnout_rate, group = type), data = .) +
    geom_line(aes(color = type)) +
    scale_x_continuous(breaks = seq(1980, 2008, 4)) + # redraw x-axis breaks to match presidential election cycle
    scale_color_discrete(name = "Type", labels = c("Adj. VAP", "ANES", "VAP", "VEP")) +
    labs(x = "Year", y = "Turnout Rate (%)") +
    theme_bw()